Extracting Multiwords From Large Document Collection Based N-Gram
نویسنده
چکیده
Multiword terms (MWTs) are relevant strings of words in text collections. Once they are automatically extracted, they may be used by an Information Retrieval system, suggesting its users possible conceptual interesting refinements of their information needs. As a matter of fact, these multiword terms point to relevant information, often corresponding to topics and subtopics in the text collection, and maybe quite useful specially for highly refining generic queries. A new approach is proposed to find collocation from text document. As mentioned earlier, a collocation is just a set of words occurring together more often than by chance in a corpus. Collocations are extracted based on the frequency of the joint occurrence of the words as well as that of the individual occurrences of each of the words in the whole text. Intuitively, when a set of words is extracted as a collocation, then the joint occurrence of the words must be high in comparison to that of the constituent individual words. Keywords— Multiword terms (MWTs), Information, Collocations, Extraction , Text Document.
منابع مشابه
Multiword Frequency Analysis Based on MEDLINE N-gram Set
Multiwords are vital to better precision and recall in NLP applications. The Lexical Systems Group (LSG) developed an effective approach to add multiwords to the SPECIALIST Lexicon from the MEDLINE n-gram set. This paper describes a frequency analysis on LexMultiwords (LMWs) and acronym expansions based on the word count (WC) in MEDLINE. Results show most LMWs locate in the low WC range with be...
متن کاملGenerating Multiwords from MEDLINE in the SPECIALIST Lexicon
Multiwords are vital to better NLP systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in NLP applications, etc. The Lexical Systems Group (LSG) enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource. This paper describes a new systematic approach to lexical multiword acquisition from MEDL...
متن کاملGenerating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon
Multiwords are vital to better Natural Language Processing (NLP) systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in Medical Language Processing (MLP) applications, etc. The Lexical Systems Group has enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource for such applications. This pape...
متن کاملGenerating the MEDLINE N-Gram Set
The MEDLINE n-gram set is a very useful resource in Natural Language Processing (NLP) and Medical Language Processing (MLP). Currently, there is no MEDLINE n-gram set available in the public domain. Due to the large scale of data, it is a challenge to generate MEDLINE n-grams to fit into a research schedule with limited computer resources. The Lexical System Group (LSG) developed an algorithm t...
متن کاملInformation Extraction from Web-Scale N-Gram Data
Search engines are increasingly relying on structured data to provide direct answers to certain types of queries. However, extracting such structured data from text is challenging, especially due to the scarcity of explicitly expressed knowledge. Even when relying on large document collections, pattern-based information extraction approaches typically expose only insufficient amounts of informa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013